A reproducible document is a file or set of files, typically in a scientific or data-driven context, that includes both the content (text, tables, figures) & [the code or instructions required to generate and update that content]{uublue-bold}.
Designed to ensure that others can reproduce the same document, including its data analysis, results, and visualizations, consistently & accurately.
Reproducible documents are essential for transparent & verifiable research, allowing others to verify and build upon the work.
Literate programming is a coding & documentation approach where code and explanations are combined in a single document.
Emphasizes clear and understandable code by interleaving human-readable text (explanations, comments, and documentation) with executable code.
Fosters better communication and understanding among programmers.
5.2 Literate programming
Document a program by
Writing explanations in a natural language, interspersed with
Snippets of code written in a programming language
Two operations on this file:
Weaving: Creating a document incorporating the code and its results
Tangling: Extract the code into a script file for running
In the R ecosystem, this was first implemented as Sweave
Use LaTex to provide the mathematical & textual bits (we’ll talk about LaTex soon)
The code bits are in R
Superseded by rmarkdown and knitr
Superseded by Quarto in the both R and Python ecosystem
Integrates with Jupyter notebooks using polyglot cells
5.3 Reproducible documents
Adapting the literate programming method to creating documents
Put text and code in the same document, with access to the data from the code
Weave the document so that the results of your analysis are incorporated into the document
Given the same data and code, you’ll create the same document (reproducible)
You can look at the source file to see how the results presented are generated
Potentially you can use the same source to create many different kinds of outputs
The main purpose here is to create documents, slides, websites, blogs, books etc.
The code is a means to creating output for these products, rather than just documenting code
5.4 Tool-sets
Reproducible documents are a marriage of natural language text (usually q/ some markup) & a scripting language.
Assuming you already have R, Python, and/or anaconda installed on your computer, including the standard packages.
you will need:
Python or Anaconda
R
Jupyter
Quarto
RStudio, VS Code, Antigravity
Google Colab can be an alternative
Met It will help you to install the necessary packages and libraries for the course.
5.6 Markdown
Optional content: Either cover quickly or skip, students can review at their pace. As simple as text documents.
5.6.1 Markdown overview
Markdown is a lightweight markup language used for formatting plain text, enabling easy conversion to HTML. Popular for web content & documentation due to simplicity.
Headers: Create headings with ‘#’ (e.g., ‘# Header 1’ for the largest heading).
Emphasis: Use ’*’ or ’_’ for italics (‘italic’) and ’**’ or ’__’ for bold (‘bold’).
Lists: Create ordered (1. Item) and unordered (- Item) lists.
Links: Insert links with [text](URL).
Images: Embed images with .
Blockquotes: Quote text with ‘>’ (e.g., ‘> This is a quote’).
Inline-Code: Format inline code with backticks (e.g. `code`)
Code blocks: Multi-line code-blocks use triple backticks e.g. ```code```).
Horizontal Rule: Add a horizontal line with ‘—’ or ’___’.
Tables: Create tables using ‘|’ and ‘-’ (see Markdown table syntax).
Escape Characters: Use ‘\’ to escape special Markdown characters.
5.7 LaTex
Optional content: Either cover quickly or skip, students can review at their pace.
5.7.1 LaTex Overview
LaTex is a typesetting system commonly used for creating documents with complex formatting, such as research papers, theses, and academic articles.
Uses plain text w/ markup commands to define document structure & formatting.
Focuses on content & structure, allowing users to separate content from presentation.
Highly customizable, w/ packages & templates available for various document types.
LaTeX is particularly popular in academia & scientific publishing due to its support for mathematical equations and references.
Documents are compiled into PDF w/ LaTeX compilers like pdflatex or xelatex.
LaTeX is free and open-source, compatible with Windows, macOS, and Linux, and widely used in technical and scientific fields.
We often use markdown for document generation rather than LaTex. You DON'T have to be a LaTex expert, but, familiarity w/ writing math in LaTex is important
5.7.2 LaTex Motivation
It’s worth learning a little bit of LaTex both for its mathematical typesetting and for its strong formatting capabilities.
It is common for universities and journals and publishers to provide LaTex document classes for typesetting in their preferred manner.
\left(...\right): Automatically size parentheses and brackets.
\alpha, \beta, etc.: Use Greek letters.
^ and _: Superscripts and subscripts, e.g., x^2\(\rightarrow x^2\) and a_{ij}\(\rightarrow a_{ij}\).
5.7.6 LaTex installation
Most reproducible document systems require an installation of LaTex on your computer or system. In particular, both RMarkdown and Quarto use it to generate PDF documents.
There are several distributions of LaTex available, including
However, there is a smaller, essential LaTex distribution available through the tinytex package, based on the TeX Live distribution
quarto install tinytex
This installation remains local to Quarto, i.e., it doesn’t affect other apps on your computer that use TeX. To make this your default TeX installation on your computer, use
quarto install tinytex --update-path
See the Quarto documentation on PDF Engines, and here for other Quarto tools.
Mathpix is a tool that uses optical character recognition to convert images of handwritten or printed math equations into digital formats, aiding quick integration of math content into documents and applications.
5.9 R-Markdown
Optional content: Either cover quickly or skip, students can review at their pace.
5.9.1 RMarkdown
Markdown is a text-to-HTML conversion tool meant for web writers to write HTML pages without having to deal with all the syntax
It allows the use of LaTex for math typesetting :wink: :wink:
RMarkdown creates a noweb-like system for creating Markdown documents using R
RMarkdown was preceded by Yihui Xie’s knitr package, & still has most of its capabilitie
This is still inspired by Sweave
Text is written in Markdown markup
Code chunks are delimited by ```{r} in line with Markdown syntax
A richer set of options for chunks
5.9.2 Pandoc
Pandoc is a “universal document converter” that has taken Markdown-based pipelines to a next level by allowing different kinds of outputs
Note the breadth of available formats for conversion
For us, the most useful are
HTML
Word/Powerpoint
5.9.3 RMarkdown document
Andrew Bray, rstudio::conf(2022)
5.10 RMarkdown
RMarkdown has a strong ecosystem around it due to its extensibility.
Documents
Interactive documents (including dashboards and Shiny apps)
Presentations
Books (see bookdown.org for many great R books freely available)
Websites
Journal templates (using the rticles package)
5.11 Jupyter
Optional content: Either cover quickly or skip, students can review at their pace.
5.11.1 Jupyter Lab & Notebook
JupyterLab is a web-based interactive user interface
Common in Kaggle competitions & common analyses interface w/ literate programming
Inbuilt into many cloud platforms (Google Colab, DataBricks, AWS Sagemaker)
Can be read, edited and run directly in VS Code or Pycharm without starting a web service
… But
But some would call this a red herring, since you can always to “Run all cells” in a notebook to do things in the right order
5.12 Quarto
Quarto is the next generation of RMarkdown developed by Posit
5.12.1 Quarto overview
Quarto is a modern publishing system designed for data science and technical communication, focusing on reproducibility, interactivity, and flexibility:
Markdown-Based: Utilizes Markdown for writing content, embedding code, and creating interactive documents.
Data-Driven: Supports integration of code & data to generate dynamic, up-to-date reports and documents.
Reproducibility: Ensures the ability to recreate documents with consistent results using source code and data.
Customizable Outputs: Generates various outputs like HTML, PDF, and more, allowing customization for different needs.
Interactive Elements: Enables creation of interactive charts, tables, and visualizations directly within documents.
Extensible: Offers extensibility via plugins with tools like R, Python, and Jupyter.
Allows more complex HTML structures like div blocks, tabsets, callouts, etc.
RMarkdown documents based on pandoc will render in quarto
xaringan, a popular presentation package based on RMarkdown, renders directly to remark.js and so is not compatible with Quarto
We can now use Jupyter notebooks as a source for Quarto, allowing the production of Python-based documents with the same fidelity as R users experience.
In fact, you can use any available Jupyter kernel for the source Jupyter notebook, not just Python
5.14 RMarkdown & Quarto
5.15 Single source, many outputs
We can create content (text, code, results, graphics) within a source document, and then use different weaving engines to create different document types
Documents
Web pages (HTML)
Word documents
PDF files
Presentations
HTML
PowerPoint
Websites/blogs
Books
Dashboards
Interactive documents
Formatted journal articles
Most RMarkdown documents readily render in Quarto
5.16 Installation
Quarto is a standalone app on your computer, not a R or Python package
Keeping RMarkdown for lots of things, since it still works
Ability to create Python-based documents (great for ML projects)
Love some new things for presentations
The main story is that there is nothing wrong with either RMarkdown or Jupyter per se, but Quarto enhances the feature set and gives Python a proper reproducible document tool. RMarkdown is still richer.
The major enhancement over both is in creating presentations, and the ease of constructing particular layouts
5.18 Aside: Project website and presentation
It is a COURSE REQUIREMENT, that you build your website and presentation with Quarto
It is highly recommended that use .qmd files and NOT .ipynb files for the website building.
However, it is easy to convert between the two formats by using the quarto convert filename command
Functionally the two formats are basically identical, i.e. they’re just Markdown + Code
However there is ONE MAJOR DIFFERENCE, i.e. .ipynb stores the code outputs in the meta-data of the file
This means you ONLY HAVE TO RUN THE CODE ONCE with .ipynb
.qmd will run the code every time you build the report, which can be very slow
There are caching options for .qmd to avoid this, however, they are “messier” that just using .ipynb
Note: If .qmd has no code, then it is basically just a Markdown file .md
5.19 Using Quarto
Tour through various important quarto constructs
5.19.1 LaTex in quarto
Since Quarto is built on top of Markdown, coding in Quarto should be pretty intuitive
5.19.1.1 Quarto syntax
- This is an example of a bulleted list with math
- Here is an in-line math equation $f(x)=\frac{e^{x^2}}{2}$
$$ g(x)=x^n \rightarrow \frac{\partial g}{\partial x}=n x^{n-1} $$
$$
\begin{align}
g(x)=x^n \rightarrow \frac{\partial g}{\partial x}=n x^{n-1}\\
h(x)=\int x^n dx \rightarrow h(x)=\frac{x^{n+1}}{n+1}
\end{align}
$$
5.19.1.2 Result
This is an example of a bulleted list with math
Here is an in-line math equation \(f(x)=\frac{e^{x^2}}{2}\)
#| layout-ncol: 2#| fig-cap:#| - Speed and Stopping Distances of Cars#| - Vapor Pressure of Mercury as a Function of Temperature#| code-fold: false#| vscode: {languageId: r}#| eval: false#| echo: trueplot(cars)plot(pressure)
5.19.7 Code highlighting
#| layout-ncol: 2#| label: fig-charts#| fig-cap: Charts#| fig-subcap:#| - Speed and Stopping Distances of Cars#| - Vapor Pressure of Mercury as a Function of Temperature#| code-fold: false#| code-line-numbers: '|3|4-7'#| vscode: {languageId: r}#| eval: false#| echo: trueplot(cars)plot(pressure)
5.19.8 Converting file types
You can switch between .qmd and .ipynb with quarto
quarto convert clustering.qmd this will output a .ipynb version called clustering.ipynb
quarto convert eda.ipynb this will output a .qmd version called eda.qmd
Examples:
quarto convert filename.ipynb
quarto convert filename.qmd
quarto convert filename.rmd
quarto preview filename.qmd
quarto preview filename.ipynb
quarto render filename.qmd
quarto render filename.ipynb
5.19.9 Embedding video
{{< video https://youtu.be/Z8t4k0Q8e8Y height="400" >}}
YAML (YAML Ain’t Markup Language): A human-readable data serialization format.
Indentation: Uses whitespace indentation (spaces or tabs) for structure, without relying on explicit delimiters like braces or brackets.
Key-Value Pairs: Represents data as key-value pairs, e.g., key: value.
Lists: Represents lists using hyphens, e.g., - item1, - item2.
Nested Structures: Supports nested data structures using indentation.
Comments: Allows comments with #.
Scalars: Represents simple data types like strings, numbers, and booleans.
Data Types: Supports various data types including strings, numbers, booleans, null, and dates.
Inclusion: Permits the use of anchors and aliases for reusing data.
Readability: Prioritizes human readability and is often used in configuration files and data exchange between languages.
No Code Execution: YAML is not intended for executing code, making it safer than some other formats for data exchange.
5.19.11 Aside: Tip for Mac users
command+control+shift+4 is very useful on a mac.
It takes a screenshot and saves it to the clip-board
Windows has a similar shortcut but you will have to google it
The following VSC extension allows you to paste images from the clip-board with alt+command+v.
tab is your best friend when using the command line, since it does auto-completion
open ./path_to_file will open any file or directory from the command line –>
5.20 Citations and Bibliographies
5.20.1 Citations
When creating documents, it is imperative that we properly cite and attribute sources we use in our document
This is part of the Honor Code for academic integrity
This is about being honest with intellectual property & giving credit where credit is due.
Keeping track of sources and citations often ends up being a big deal
5.20.2 Reference management
There are several established reference managers available, both commercial and open-source.
Many of these can search online resources to add references and PDF
Many have cloud storage allowing you to access your references from anywhere
They can insert citations and create bibliographies in documents
5.20.3 BibTeX
BibTeX is a tool and file format that is used to describe and process lists of references. It was developed alongside LaTex.
BibTeX works especially well with the Markdown/Pandoc workflow of Quarto
It also works with Word using the BibTeX4Word plugin
A typical BibTeX entry looks like this:
@article{article_key,
author = {Peter Adams},
title = {The title of the work},
journal = {The name of the journal},
year = 1993,
number = 2,
pages = {201-213},
month = 7,
note = {An optional note},
volume = 4
}
The crucial piece in this entry is the article_key by which you will refer to the citation in your document
There are 14 key BibTeX types, but the most common are @article, @book and @unpublished
However, you don’t need to write out these entries by hand
5.20.4 Reference managers
Several reference manages are available that provide citations in BibTeX format, among other formats
To see some comparisons between these and a popular commercial reference manager, see here
5.20.5 Citations in Quarto
There are a few citation formats that can be used with Quarto, through pandoc.
Format
File extension
BibLaTeX
.bib
BibTeX
.bibtex
CSL JSON
.json
CSL YAML
.yaml
RIS
.ris
NoteNote
The RIS format is available from many reference managers, including Endnote
The CSL JSON format is also an available format from many reference managers
Pandoc has the ability to automatically generate citations and a bibliography from BibTeX files and embedded references (much like we replace R/Python code with their output).
Point to .bib files in the YAML header, replacing the name w/ the file-name you’re using
You would need to download the .csl file from the repository and keep it in the same folder as your document.
You make a citation using the standard Pandoc/BibTeX syntax ([@citation]), where these are the citation-keys we saw before in the BibTeX entries.
You can use multiple citations at a time, separated by semi-colons
Pandoc will generate a bibliography and place it in the document. It will be placed in a div with the id refs if one exists, otherwise it will be place in the end of the document
### References
::: {#refs}
:::
Depending on the CSL format, the bibliography will conform to the ordering and format of that CSL specification.
Citation styles
American Statistical Association
Nature
Chicago
5.21 Supplementary material
Optional content: Either cover quickly or skip, students can review at their pace.
5.21.1 Using Quarto in RStudio
Given that Quarto is developed by Posit, the company behind RStudio, Quarto works quite seamlessly in RStudio
RStudio provides both a source editor and a visual editor experience for editing Quarto files
The visual editor in RStudio does not work really well with more complex constructs like columns, callouts and tabsets. It tends to change the source code leading to problems
At the top of the file, we put an (optional) YAML header to provide information about document output, general formatting rules and the like.
You can also specify the computational engine you want to use for the Quarto document in the YAML header (only if you’re using a non-R engine via Jupyter)
You specify the Jupyter kernel you want to use as well
Important
Note that YAML is a hierarchical specification with hierarchy denoted by indentation. This is important to maintain, so your specification is clear. Note, for example, that the jupyter option is a top-level option, while the html and pdf specifications are within the format specification
The R engine will then use the anly503 conda environment for all Python chunks.
It will also run all the chunks in the same Python instance, so you can pass data from one chunk to the next
The Jupyter engine does not allow multipe engines to run concurrently, so rendering both R and Python chunks in the same Jupyter document doesn’t work
5.23.4 Tangling
Sometimes it is useful to extract the actual code from the Quarto document so that it can just run without the overhead of processing the full document or noteboook. This is called tangling
This feature is currently not implemented in Quarto :frowning:
For R/knitr-based quarto documents, one can use knitr::purl